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Abstract 



This paper studies sequence prediction based on the monotone Kolmogorov 
complexity Km=—logm, i.e. based on universal deterministic/one-part MDL. 
m is extremely close to Solomonoff's prior M, the latter being an excellent 
predictor in deterministic as well as probabilistic environments, where perfor- 
mance is measured in terms of convergence of posteriors or losses. Despite this 
closeness to M, it is difficult to assess the prediction quality of m, since little is 
known about the closeness of their posteriors, which are the important quanti- 
ties for prediction. We show that for deterministic computable environments, 
the "posterior" and losses of m converge, but rapid convergence could only 
be shown on-sequence; the off-sequence behavior is unclear. In probabilistic 
environments, neither the posterior nor the losses converge, in general. 
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1 Introduction 



Complexity based sequence prediction. In this work we study the performance 
of Occam's razor based sequence predictors. Given a data sequence x\, x 2 , x n -i 
we want to predict (certain characteristics) of the next data item x n . Every x t is an 
element of some domain X, for instance weather data or stock-market data at time 
t, or the t th digit of ir. Occam's razor |LV97j . appropriately interpreted, tells us to 
search for the simplest explanation (model) of our data xi,...,x n -i and to use this 
model for predicting x n . Simplicity, or more precisely, effective complexity can be 
measured by the length of the shortest program computing sequence x:=X\...x n -\. 
This length is called the algorithmic information content of x, which we denote by 
K(x). K stands for one of the many variants of "Kolmogorov" complexity (plain, 
prefix, monotone, ...) or for — \ogk(x) of universal distributions/measures k(x). For 
simplicity we only consider binary alphabet A" = {0,1} in this work. 

The most well-studied complexity regarding its predictive properties is KM(x) = 
— logM(x), where M(x) is Solomonoff's universal prior [Sol64, Sol78j. Solomonoff 
has shown that the posterior M(x t \xi...x t -i) rapidly converges to the true data 
generating distribution. In [HutOlbt IHut02j it has been shown that M is also an 
excellent predictor from a decision-theoretic point of view, where the goal is to 
minimize loss. In any case, for prediction, the posterior M{x t \x\...Xt-i), rather than 
the prior M(xi : t), is the more important quantity. 

Most complexities K coincide within an additive logarithmic term, which implies 
that their "priors" k = 2~ K are close within polynomial accuracy. Some of them are 
extremely close to each other. Many papers deal with the proximity of various 
complexity measures |Lev73| ICac83l ...]. Closeness of two complexity measures is 
regarded as indication that the quality of their prediction is similarly good |LV97t 



p. 334]. On the other hand, besides M, little is really known about the closeness of 
"posteriors" , relevant for prediction. 

Aim and conclusion. The main aim of this work is to study the predictive prop- 
erties of complexity measures, other than KM. The monotone complexity Km is, 
in a sense, closest to Solomonoff's complexity KM. While KM is defined via a mix- 
ture of infinitely many programs, the conceptually simpler Km approximates KM 
by the contribution of the single shortest program. This is also closer to the spirit 
of Occam's razor. Km is a universal deterministic/one-part version of the popu- 
lar Minimal Description Length (MDL) principle. We mainly concentrate on Km 
because it has a direct interpretation as a universal deterministic/one-part MDL 
predictor, and it is closest to the excellent performing KM, so we expect predictions 
based on other K not to be better. 

The main conclusion we will draw is that closeness of priors does neither neces- 
sarily imply closeness of posteriors, nor good performance from a decision-theoretic 
perspective. It is far from obvious, whether Km is a good predictor in general, and 
indeed we show that Km can fail (with probability strictly greater than zero) in the 
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presence of noise, as opposed to KM. We do not suggest that Km fails for sequences 
occurring in practice. It is not implausible that (from a practical point of view) mi- 
nor extra (apart from complexity) assumptions on the environment or loss function 
are sufficient to prove good performance of Km. Some complexity measures like K, 
fail completely for prediction. 

Contents. Section^ introduces notation and describes how prediction performance 
is measured in terms of convergence of posteriors or losses. Section [3] summa- 
rizes known predictive properties of Solomonoff 's prior M. Section ^] introduces 
the monotone complexity Km and the prefix complexity K and describes how they 
and other complexity measures can be used for prediction. In Section^we enumer- 
ate and relate eight important properties, which general predictive functions may 
posses or not: proximity to M, universality, monotonicity, being a semimeasure, the 
chain rule, enumerability, convergence, and self-optimizingness. Some later needed 
normalization issues are also discussed. Section® contains our main results. Mono- 
tone complexity Km is analyzed quantitatively w.r.t. the eight predictive properties. 
Qualitatively, for deterministic, computable environments, the posterior converges 
and is self-optimizing, but rapid convergence could only be shown on-sequence; the 
(for prediction equally important) off-sequence behavior is unclear. In probabilistic 
environments, m neither converges, nor is it self-optimizing, in general. The proofs 
are presented in Section^ 5ectzon[3]contains an outlook and a list of open questions. 



2 Notation and Setup 

Strings and natural numbers. We write X* for the set of finite strings over 
binary alphabet ^ = {0,1}, and X°° for the set of infinity sequences. We use letters 
i,t,n for natural numbers, x,y,z for finite strings, e for the empty string, l(x) for 
the length of string x, and uj = Xi :oa for infinite sequences. We write xy for the 
concatenation of string x with y. For a string of length n we write x\X2---X n with 
x t EX and further abbreviate x\- n : = x\X2---x n -\x n and x <n :=xi...x n -i- For a given 
sequence xi :oo we say that x t is on-sequence and x t ^x t is off-sequence. x' t may be 
on- or off-sequence. 

Prefix sets/codes. String x is called a (proper) prefix of y if there is a z(^e) such 
that xz = y. We write x* = y in this case, where * is a wildcard for a string, and 
similarly for infinite sequences. A set of strings is called prefix-free if no element is a 
proper prefix of another. A prefix-free set V is also called a prefix-code. Prefix-codes 
have the important property of satisfying Kraft's inequality Ylxev^ — !• 

Asymptotic notation. We abbreviate \im t ^ oo [f(t)—g(t)]=0 by /(£)*— ^ g(t) and 
say / converges to g, without implying that hm t _ +0O ^(t) itself exists. We write 

X + 

f{x)<g{x) for f(x) = 0(g(x)) and f[x)<g(x) for f(x)<g(x)+0(l). Corresponding 
equalities can be defined similarly. They hold if the corresponding inequalities hold 
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in both directions. < 00 implies a t ^»0. We say that a t converges fast or 

rapidly to zero if J2uLi a t — c i where c is a constant of reasonable size; c = 100 is 
reasonable, maybe even c = 2 30 , but c = 2 500 is not. 1 The number of times for which 
at deviates from by more than e is finite and bounded by c/e 2 ; no statement is 
possible for which t these deviations occur. The cardinality of a set S is denoted by 
\S\ or #<S. 

(Semi) measures. We call p: X* — > [0,1] a (semi)measure iff Y^, x „exP( x i:n) — \{ x <n) 

and p(e) = 1. p(x) is interpreted as the p-probability of sampling a sequence which 
starts with x. The conditional probability (posterior) 

p(x t \x <t ) := ^p^l (1) 

is the p-probability that a string xi...x t -i is followed by (continued with) x t . We 
call p deterministic if 3u : p(c<j 1:n ) = 1 Vn. In this case we identify p with oj. 

Convergent predictors. We assume that p is "true" 2 sequence generating mea- 
sure, also called environment. If we know the generating process p, and given past 
data x <t we can predict the probability p(x t \x <t ) of the next data item x t - Usually 
we do not know p, but estimate it from x<t- Let p(x t |x <t ) be an estimated prob- 
ability 3 of x t , given x <t . Closeness of p(x t \x <t ) to p(x t \x <t ) is expected to lead to 
"good" predictions: 

Consider, for instance, a weather data sequence x\. n with x t — l meaning rain and 
x t = meaning sun at day t. Given x <t the probability of rain tomorrow is p(l|x <t ). 
A weather forecaster may announce the probability of rain to be yt :=p(l|x<t), which 
should be close to the true probability p{l\x <t ). To aim for 

p(x' t \x <t ) p(x' t \x <t ) for t —>■ 00 (2) 

seems reasonable. A sequence of random variables z t = z t {oS) (like z t = p(x t \x<t) — 
p,(x t \x <t )) is said to converge to zero with p-probability 1 (w.p.l) if the set {u : 
z t {uj) 0} has p-measure 1. Zt is said to converge to zero in mean sum (i.m.s) 
if Z)t^iE[-2 t 2 ] <c<oo, where E denotes p-expectation. Convergence i.m.s. implies 
convergence w.p.l (rapid if c is of reasonable size). 

Depending on the interpretation, a p satisfying (J2J could be called consistent 
or self-tuning [KV 86j . One problem with using (J2J) as performance measure is that 
closeness cannot be computed, since p is unknown. Another disadvantage is that (J2J) 
does not take into account the value of correct predictions or the severity of wrong 
predictions. 

Self-optimizing predictors. More practical and flexible is a decision-theoretic 
approach, where performance is measured w.r.t. the true outcome sequence x\- n 

1 Environments of interest have reasonable complexity K, but 2 K is not of reasonable size. 

2 Also called objective or aleatory probability or chance. 

3 Also called subjective or belief or epistemic probability. 
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by means of a loss function, for instance £ xm := (xt — yt) 2 , which does not involve 
/i. More generally, let £ xm G [0,1] C M be the received loss when performing some 
prediction/decision/action yt&y and x t &X is the t th symbol of the sequence. Let 
Vt^y be the prediction of a (causal) prediction scheme A. The true probability of 
the next symbol being x t , given x <t , is ji{x t \x <t ). The /x-expected loss (given x <t ) 
when A predicts the t th symbol is 

xt 

The goal is to minimize the /z-expected loss. More generally, we define the A p 
sequence prediction scheme 

Vt" ■= argmin^p(xi|x <t )4 m , (3) 

yt&y x t 

which minimizes the p-expected loss. If p, is known, A M is obviously the best pre- 
diction scheme in the sense of achieving minimal expected loss (/f M <l£ for all A). 
An important special case is the error-loss £ xy = l — 5 xy with y = X. In this case A p 
predicts the y t which maximizes p(y t \x <t ), and I] t E[/f p ] is the expected number of 
prediction errors (where y^ p ^x t ). The natural decision-theoretic counterpart of (0) 
is to aim for 

lt P (x<t) ^ lt"(x <t ) for t^oo (4) 
what is called (without the fast supplement) self-optimizingness in control-theory 



3 Predictive Properties of M = 2 

We define a prefix Turing machine T as a Turing machine with binary unidirectional 
input and output tapes, and some bidirectional work tapes. We say T halts on input 
p with output x and write u T(p) —x halts" if p is to the left of the input head and x 
is to the left of the output head after T halts. The set of p on which T halts forms 
a prefix-code. We call such codes p self- delimiting programs. We write T(p) =x* 
if T outputs a string starting with x; T need not to halt in this case, p is called 
minimal if T(q) ^ x* for all proper prefixes of p. The set of all prefix Turing- 
machines {Ti,T 2 ,...} can be effectively enumerated. There exists a universal prefix 
Turing machine U which can simulate every Tj. A function is called computable 
if there is a Turing machine, which computes it. A function is called enumerable 
if it can be approximated from below. Let A4™^ p be the set of all computable 
measures, M-l^^ the set of all enumerable semimeasures, and Aidet be the set of 
all deterministic measures (=Af°°). 4 

i M s e num is enumerable, but M^mv l% not ' anc ^ ■M.det is uncountable. 
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Levin X1.7Q1 ILV97j has shown the existence of an enumerable universal semimea- 
sure M (M>u EM s e e ™ l m ). An explicit expression due to Solomonoff |Sol78j is 

M(x) := 2 ^ (P) > KM(x) := -log M(x). (5) 

p:U (p)=x* 

The sum is over all (possibly non-halting) minimal programs p which output a string 
starting with x. This definition is equivalent to the probability that U outputs a 
string starting with x if provided with fair coin flips on the input tape. M can 
be used to characterize randomness of individual sequences: A sequence xi :00 is 
(Martin-L6f) /^-random, iff 3c : M(x\. n ) < c-//(xi :n )Vn. For later comparison, we 
summarize the (excellent) predictive properties of M |Sol78l IHutOlat IHut02j (the 
numbering will become clearer later): 

Theorem 1 (Properties of M = 2~ KM ) Solomonoff 's prior M defined in |3J) is a 
(i) universal, (v) enumerable, (ii) monotone, (Hi) semimeasure, which (vi) converges 
to fi i.m.s., and (vii) is self- optimizing i.m.s. More quantitatively: 

Y^i^\£ x [(M(x' t \x <t )- ^(x' t \x <t )) 2 ] < \n2-K(n), which implies 



[vi 



M(x' t \x <t ) fi(x' t \x <t ) i.m.s. for fiE M. 



msr 
comp ' 



{vii) E i =iE[(/f M -/f") 2 ] < 2ln2-K(fx), which implies 



/■msr 



where K(fi) is the length of the shortest program computing function fi. 



4 Alternatives to Solomonoff's Prior M 

The goal of this work is to investigate whether some other quantities which are closely 
related to M also lead to good predictors. The prefix Kolmogorov complexity K is 
closely related to KM (K(x) —KM (x)+0 (log l(x))). K(x) is defined as the length 
of the shortest halting program on U with output x: 

K(x) := mm{l(p) : U(p) = x halts}, k(x) := 2~ x(x) . (6) 

In Section |H1 we briefly discuss that K completely fails for predictive purposes. More 
promising is to approximate M(x) = Sp : c/(p)=x*2 _ ^ p ^ by the dominant contribution 
in the sum, which is given by 

m(x) := 2~ Km{x) with Km(x) := mm{l(p) : U(p) = x*}. (7) 

Km is called monotone complexity and has been shown to be very close to KM 
|Lev73| Rjac83j (see also Theorem 0o)). It is natural to call a sequence xi :00 com- 
putable if Km(x\- OD ) < oo. KM, Km, and K are ordered in the following way: 

< K(x\l(x)) < KM(x) < Km(x) < K(x) < l(x) + 2 \ogl(x). (8) 



Predictions based on Kolmogorov Complexity 



7 



There are many complexity measures (prefix, Solomonoff, monotone, plain, process, 
extension, ...) which we generically denote by K e {K,KM,Km,...} and their as- 
sociated "predictive functions" k(x) :=2~ K ^ G {k,M,m,...}. This work is mainly 
devoted to the_ study of m. 

Note that k is generally not a semimeasure, so we have to clarify what it means 
to predict using k. One popular approach which is at the heart of the (one-part) 
MDL principle is to predict the y which minimizes K(xy) (maximizes k(xy)), where 
x are past given data: yf DL :=aigmm yt K(x <t y t ). 

For complexity measures K, the conditional version K\{x\y) is often defined 5 as 
K{x), but where the underlying Turing machine U has additionally access to y. The 
definition h(x\y) :=2~ K ^ V ' for the conditional predictive function k seems natural, 
but has the disadvantage that the crucial the chain rule JI} is violated. For K = K 
and K = Km and most other versions of K, the chain rule is still satisfied approx- 
imately (to logarithmic accuracy), but this is not sufficient to prove convergence 
(J2J) or self-optimizingness (jlj). Therefore, we define k(x t \x <t ) \=k(xi- t ) / k(x <t ) in the 
following, analogously to semimeasures p (like M). A potential disadvantage of this 
definition is that k(x t \x <t ) is not enumerable, whereas k\(x t \x <t ) and k(xi :t ) are. 

We can now embed MDL predictions minimizing K into our general framework: 
MDL coincides with the predictor for the error-loss: 

y t k = argmax k(y t \x <t ) = argmax k{x <t y t ) = argmin K(x <t y t ) = yf DL (9) 
yt yt yt 

In the first equality we inserted £ xy = 1 — S xy into 0. In the second equality we 
used the chain rule (HJ). In both steps we dropped some in argmax ineffective addi- 
tive/multiplicative terms independent of y t . In the third equality we used k = 2~ K . 
The last equality formalizes the one-part MDL principle: given x <t predict the yt^X 
which leads to the shortest code p. Hence, validity of (J3J) tells us something about 
the validity of the MDL principle. (J2J) and address what (good) prediction means. 

5 General Predictive Functions 

We have seen that there are predictors (actually the major one studied in this work) 
A p , but where p(xt\x < t) is not (immediately) a semimeasure. Nothing prevents 
us from replacing p in Q by an arbitrary function b\ : X* —>■ [0,oo), written as 
b\(x t \x <t ). We also define general functions b:X*^ [0,oo), written as b(xi :n ) and 

b(x t \x <t ) := b^t) ' wn i° n ma y n °t coincide with b\(x t \x <t ). Most terminology for 
semimeasure p can and will be carried over to the case of general predictive functions 
b and b\, but one has to be careful which properties and interpretations still hold: 

Definition 2 (Properties of predictive functions) We call functions b,b\ : 
X* — > [0,oo ) (conditional) predictive functions. They may possess some of the fol- 
lowing properties: 



5 Usually written without index |. 
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6) Proximity: b(x) is "close" to the universal prior M(x) 

i) Universality: b>Ai, i.e. We M. 3c>0:b(x) >c-v(x)\/x. 
ii) Monotonicity: b(xi ]t ) <b(x <t ) Wt,Xi :t 
Hi) Semimeasure: J2 Xt K x i:t) <K x <t) andb(e)<\ 
iv) Chain rule: b(x\.t) = b.(x t \x < t)b(x <t ) 
v) Enumerability: b is lower semi- computable 

vi) Convergence: b^x'^x^Y 1 ^ n(x' t \x <t ) W [i & Ai ,x' t & X i.m.s. orw.p.l 
vii) Self-optimizingness: /f 6, — ^» /f M i.m.s. orw.p.l 
where b. refers to b or b\ 

The importance of the properties (i) — (iv) stems from the fact that they together 
imply convergence (vi) and self-optimizingness (vii). Regarding proximity (o) we 
left open what we mean by "close". We also did not specify M. but have in mind 
all computable measures M.™om P or enumerable semimeasures Ai s en m m , possibly re- 
stricted to deterministic environments Aidet- 

Theorem 3 (Predictive relations) 

a) (iii)=^(ii): A semimeasure is monotone. 

b) (i),(iii),(iv) =>- (vi): The posterior b. as defined by the chain rule (iv) of a 
universal semimeasure b converges to /i i.m.s. for all fi&Ai. 

c) (i),(iii)i(v)=> (o): Every w.r.t. Ai s en m m universal enumerable semimeasure co- 
incides with M within a multiplicative constant. 

d) (vi) =>■ (vii): Posterior convergence i.m.s. /w. p. 1 implies self-optimizingness 
i.m.s./w.p.l. 

Proof sketch, (a) follows trivially from dropping the sum in (ii), (b) and (c) are 
Solomonoff's major results |SoT78l ILWfl IHutOlaj . (d) follows from < lf b - -Z f Afl < 
E x ' t \b.(x r t \x <t )-fi(x' t \x <t )\, since ^6 [0,1] |Hut021 Th.4(iz)]. □ 
We will see that (i) ,(iii) ,(iv) are crucial for proving (vi),(vii). 

Normalization. Let us consider a scaled b version b norm (x t \x <t ) : = c(x <t )b(x t \x <t ), 
where c > is independent of x t . Such a scaling does not affect the prediction 
scheme (|3*|). i.e. yf Lb = y t bnorm , which implies l t bnorm — l t b . Convergence b(x' t \x < t)^ 
li(x' t \x <t ) implies T, x 'K x t\ x <t) — > 1 if yu is a measure, hence also 

bnorm(%t \ X <t) * 

fj,(x' t \x <t ) for 6 c(x <t ) := \^2 x ib(x' t \x ■ Speed of convergence may be affected by 
normalization, either positively or negatively. Assuming the chain rule (JTJ) for b norm 
we get 

" b ( x i:t) M ,u ^ m ^ 1 A K x <t) 



b n orm(%l:n) — TT ^ 77 7 ~~ d(x <n )b(x\- n ) , d(x <n ) \— —— TT , 

t=i K x i-.t) 6(e) ii E xt b(x 1:t ) 

6 Arbitrarily we define b norm (xt\x <t ) = t^t if J2 x 'K x t\ x <t) = 0. 
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Whatever b we start with, b norm is a measure, i.e. (Hi) is satisfied with equality. 
Convergence and self-optimizingness proofs are now eligible for b norm , provided uni- 
versality (i) can be proven for b norm . If b is a semimeasure, then d > 1, hence 

X 

M norm > M > M S z™ m is universal and converges (v i) with same bound (Theorem 
^fi)) as for M. On the other hand d(x <n ) may be unbounded for b = k and b = m, 
so normalization does not help us in these cases for proving (vi). Normalization 
transforms a universal non-semimeasure into a measure, which may no longer be 
universal. 



6 Predictive Properties of m = 2 



-Km 



We can now state which predictive properties of m hold, and which not. In order not 
to overload the reader, we first summarize the qualitative predictive properties of m 
in Corollary HI and subsequently present detailed quantitative results in Theorem 
followed by an item-by-item explanation and discussion. The proofs are deferred to 
the next section. 

Corollary 4 (Properties of m = 2~ Km ) For b=m = 2~ Km , where Km is the mono- 
tone Kolmogorov complexity the following properties of Definition [H are satis- 
fied/violated: (o) For every jiEJ^i^ ip and every ^-random sequence x 1:oo; m(x 1:n ) 
equals M(xi- n ) within a multiplicative constant, m is (i) universal (w.r.t. M. = 
M™m P ), (ii) monotone, and (v) enumerable, but is -i(ui) not a semimeasure. m 
satisfies (iv) the chain rule by definition for m. = m, but for m.=m\ the chain rule 
is only satisfied to logarithmic order. For m. = m, m (vi) converges and (vii) is self- 
optimizing for deterministic /i G M^ p nMdet, but in general not for probabilistic 

The lesson to learn is that although m is very close to M in the sense of (o) and m 
dominates all computable measures /i, predictions based on m may nevertheless fail 
(cf. Theorem P). 

Theorem 5 (Detailed properties of m = 2~ Km ) For b = m = 2~ Km , where 
Km(x) :=mm p {l(p) : U(p) = x*} is the monotone Kolmogorov complexity, the fol- 
lowing properties of Definition^ are satisfied / violated: 

(o) V n GjM^V 'fi-random u 3c w : Km(u 1:n )< KM (u^+c^n, \Lev7tf 

KM(x)<Km(x)<KM(x) + 2\ogKm(x) + 0(l)Vx. IZTWj Th.3.4] 

-i(o) Vc : Km(x) — KM(x) >c for infinitely many x. 



i 



Km(x)< -log fi(x)+K(fi) ifneM™Z P > x Wn Th.4.5.4] 

m>M™, butm^MlZL (unlike M > Mf^U ■ 



(ii) Km(xy)>Km(x)eJN , 0<m(xy) <m(x) e2~ w ° < 1. 
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-i(m) If xi;oo is computable, then J2 X t m ( x i:t) ^ m i x <t) for almost allt, 
If Km{x\; t )—o{t), then ^2 xt m(xi :t ) ^m(x <t ) for mostt. 

(iv) 0<m(x\y):=^<l. 

-i(iv) if m\(x\y) :=2~ mnip { l ^ :U ( p ' y ^ =x *\ then 3x,y :m(yx) =£m\(x\y) -m(y) , 
Km(yx) = Km\(x\y)+Km(y)±0(\ogl(xy)). 

(v) m is enumerable, i.e. lower semi- computable. 

(vi) Y%=i\l- m ( x t\ x <t)\<\K™>(xi-.n), m(x t \x <t ) -^il for comp. x 1:oo , 
Indeed, m{x t \x <t ) ^ 1 at most Km(xi :oo ) times, 

J2t = i"m(x t \x <t ) <2 Km( - Xl:n \ m(x t \x <t ) s ^^0 for computable Xi :oo . 
-■(vi) 3iieM^Z p \M de t ■ m {nor . m) (x t \x <t ) fi(x t \x <t ) Vx 1:oo 

(vii) / Am (x <t ) ^+ Z AlJ r^argmin^^^ ifu = Xi :oo is computable. 
A m = A mnorm , i. e. y Am = yf— and Z Am = Z t A m "°™ . 

i(uii) V|^|>2 3£,/i : tf m /lt"=c>lVt (c=|-e possible). 

V non-degenerate £ 3U,/2 : Z Am // Am '-^$1 wift /iio/i probability. 



Explanation and discussion, (o) The first line shows that m is close to M within 
a multiplicative constant for nearly all strings in a very strong sense. sup ra ^(^i") < 
2 Ct " is finite for every uj which is random (in the sense of Martin-L6f) w.r.t. any 
computable //, but note that the constant c w depends on uj. Levin falsely conjectured 
the result to be true for all uj, but could only prove it to hold within logarithmic 
accuracy (second line). 

-i(o) A later result by Gacs, indeed, implies that Km — KM is unbounded (for 
infinite alphabet it can even increase logarithmically). 

(i) The first line can be interpreted as a "continuous" coding theorem for Km 
and recursive /i. It implies (by exponentiation) that m dominates all computable 
measures (second line). Unlike M it does not dominate all enumerable semimea- 
sures. Dominance is a key feature for good predictors. From a practical point of 
view the assumption that the true generating distribution /j, is a proper measure 
and computable seems not to be restrictive. The problem will be that m is not a 
semimeasure. 

(ii) The monotonicity property is obvious from the definition of Km and is the 
origin of calling Km monotone complexity. 

-i(m) shows and quantifies how the crucial semimeasure property is violated for 
m in an essential way, where almost all n means "all but finitely many," and most n 
means "all but an asymptotically vanishing fraction." . 

(iv) the chain rule can be satisfied by definition. With such a definition, m(x\y) is 
strictly positive like M(x\y), but not necessarily strictly less than 1, unlike M(x\y). 
Nevertheless it is bounded by 1 due to monotonicity of m, unlike for k. 
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-i(iv) If a conditional monotone complexity Km\ — — log m\ is defined similarly to 
the conditional Kolmogorov complexity K\, then the chain rule is only valid within 
logarithmic accuracy 

(v ) m shares the obvious enumerability property with M. 

(vi) (first line) shows that the on-sequence predictive properties of m for deter- 
ministic computable environments are excellent. The predicted m-probability 7 of 
x t given x <t converges rapidly to 1 for reasonably simple/complex Xi :00 . A similar 
result holds for M. The stronger result (second line), that m(xt\x < t) deviates from 
1 at most Km{x\; 00 ) times, does not hold for M. Note that perfect on-sequence 
prediction could trivially be achieved by always predicting 1 (b. = 1). Since we do 
not know the true outcome x t in advance, we need to predict m(x' t \x <t ) well for all 
x' t e X. m(\) also converges off-sequence for x t ^x t (to zero as it should be), but 
the bound (third line) is much weaker than the on-sequence bound (first line), so 

rapid convergence cannot be concluded, unlike for M, where M(xt\x<t) — *1 implies 

M(xt\x<t) — *0, since V^ a ,/M(xJ|x < j) < 1. Consider an environment xi :00 describable 
in 500 bits, then bound (vi) line 2 does not exclude m(x t \x <t ) from being 1 (maxi- 
mally wrong) for all t = 1..2 500 ; with asymptotic convergence being of pure academic 
interest. 

-i(vi) The situation is provably worse in the probabilistic case. There are com- 
putable measures fi for which neither m(x t \x <t ) nor m norm (xt\x <t ) converge to 
n(x t \x <t ) for any x lvoo . 

(vii) Since (vi) implies (vii) by continuity, we have convergence of the instan- 
taneous losses for computable environments xi :oo , but since we do not know the 
speed of convergence off-sequence, we do not know how fast the losses converge to 
optimum. 

-i(vii) Non-convergence ->(vi) does not necessarily imply that A m is not self- 
optimizing, since different predictive functions can lead to the same predictor A. 
But it turns out that A m is not self-optimizing even in Bernoulli environments fi 
for particular losses £ (first line). A similar result holds for any non- degenerate loss 
function (especially for the error- loss, cf. 0), for specific choices of the universal 
Turing-machine U (second line). Loss £ is defined to be non-degenerate iffClxexiv '■ 
£ X y = mm y £ xy } = {}. Assume the contrary that a single action y is optimal for every 
outcome x, i.e. that (argmin y can be chosen such that) argmin^y = y Vrr. This 
implies yf p =y\/p, which implies Z^ m //f M = l. So the non-degeneracy assumption is 
necessary (and sufficient). 



7 We say "probability" just for convenience, not forgetting that m(-|a;<t) is not a proper 
(semi) probability distribution. 
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7 Proof of Theorem El 

(0) The first two properties are due to Levin and are proven in |Lev73j and |ZL70|. 
Th.3.4], respectively. The third property is an easy corollary from Gacs result 
Gac83 , which says that if g is some monotone co-enumerable function for which 

Km(x) — KM(x) < g(l(x)) holds for all x, then g(n) must be > K(n). Assume 
Km(x) — KM (x) >logZ(x) only for finitely many x only. Then there exists a c such 

that Km(x) — KM(x) < log l(x) +c for all x. Gacs' theorem now implies logn + c> 
K(n) Vn, which is wrong due to Kraft's inequality E„2~^ (n) < 1. 

(1) The first line is proven in |LV97| Th.4.5.4]. Exponentiating this result gives 

X 

m(x) >c^fi(x) Vxj/xGA^^p, i.e. m> A4™m p . Exponentiation of -i(o) gives m(x) < 

X X 

M(x)/l(x), which implies m(x)^M(x)eM s e ^l l , i.e. rn^Mf™ m . 
(ii) is obvious from the definition of Km and m. 

-i(iii) Simple violation of the semimeasure property can be inferred indirectly from 
(i),(iv ),-*(vi) and Theorem©. To prove -i(m) we first note that Km(x) <oo for all fi- 
nite strings xG X*, which implies m(xi :n ) >0. Hence, whenever Km(xi :n ) = Km(x <n ), 
we have J2x n m ( x i:n) > m ( x i-.n) = m ( x <n), a violation of the semimeasure property. 
-i(m) now follows from #{t < n : J2 Xt m ( x i:t) < m ( x <t)} < < n '■ K m ( x i-.t) ^ 
Km(x <t )} < YJt=iK m ( x i:t) ~ Km(x <t ) = Km(xi :n ), where we exploited (ii) in the 
last inequality. 

(iv) immediate from (ii). 

-i(iv) (first line) follows from the fact that equality does not even hold within 
an additive constant, i.e. Km(yx) Km(x\y) + Km(y) . The proof of the latter is 
similar to the one for K (see [LV97J). ->(iv) (second line) follows within log from 
Km = K+0(\og) and K(yx) = K(x\y) + K(y) + (log) jLWfj . 

(v) immediate from definition. 

(vi) #{t < n : m(x t \x <t ) ^ 1} < ELi 2 )! -m(x t \x <t )\ < -Yt=^ogm(x t \x <t ) = 
—\ogm(xi- n )=Km(xi, n )<oo. In the first inequality we used m: = m(x t \x <t ) g2~ w °, 
hence 1 < 2|1 — m\ for m ^ 1. In the second inequality we used 1 — m< — |logm, 
valid for mG [0,|]U{1}. In the first equality we used (the log of) the chain rule n 
times. For computable Xi :00 we have Y^i\^-~ m ( x t\ x <t)\ < \Km(xi :00 ) < oo, which 
implies m(xf |x<j) — > (fast if Km(xi- 00 ) is of reasonable size). This shows the first 
two lines of (vi). The last line is shown as follows: Fix a sequence xi :00 and de- 
fine Q := {x <t x t : tElN,Xt^x t }. Q is a prefix-free set of finite strings. For any 
such Q and any semimeasure /i, one can show that J2 x eQf^( x ) — I- 8 Since M is a 



8 This follows from 1 >^(AUB) >n(A) + fi(B) if AC\B = {], T x C\T y = {} if a; not prefix of y and 
y not prefix of x, where T x :—{uj taj^^) — x}, hence E^egM-Trr) < MUxes-^) — b an d noting that 
fi(x) is actually an abbreviation for fi(T x ). 
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semimeasure lower bounded by m we get 



^m(x <t x t ) < ^m(x <t x t ) = ^ m(x) < ^ M(x) < 1. 

xeQ xeQ 



t=i 



t=i 



With this, and using monotonicity of m we get 



£ m(iAx<t) = £ < £ < i = 

^ ^ m(x <t ) fr{ m(x 1:n ) m{xi :n ) 

Finally, for an infinite sum to be finite, its elements must converge to zero. 

-i(vi) follows from the non-denseness of the range of m( norm ) : We choose p(l|a: <t ) = |, 
hence fi(0\x <t ) = |. Since m(x t \x <t ) G 2 _W() = {1,|,^,|,...}, we have |m(x t |x <t ) — 
li{x t \x <t ) \ > | Vt, Vxi :00 . Similarly for 

™ f T L \ . . m(x t \x <t ) a r 2 

lu>normKX t \X <t ) — m ( |a:<t)+m(l[x<t) fc «-2-™ 

{- 



_|_ 2 -m 



: n,meiVo} 



11112 4 8 

1+2^ "L --- '9'5'3'2'3'5'9' 



-} 



12 ' 



which implies |m norm (x i |a;< i )-/i(xt|x<t)| > -p 



12 



we choose p(l|a; <t ) = 1— p(0|x <t ) 
Vt, Va; 1:oo . 

(vii) The first line follows from (vi) and Theorem Ell. That normalization does not 
affect the predictor, follows from the definition of yf p (HU) and the fact that argminQ 
is not affected by scaling its argument. 

-i(vii) Non-convergence of m does not necessarily imply non- convergence of the 
losses. For instance, for y = {0,1}, and u' t : = 1/0 for /x(l|a:< t ) J7 := <01 _^+^°_ fll > 
one can show that yf M — yt u ', hence convergence of m(x t |x <t ) to 0/1 and not to 
n(x t \x <t ) could nevertheless lead to correct predictions. 

Consider now y G y = {0,1,2}. To prove the first line of — i(vii) we de- 
fine a loss function such that yf M 7^ yf p for any p with same range as m norm 
and for some fi. The loss function £ x0 = x, l x \ = |, l x2 = — x), and 

I will do. The p-expected loss under action y is ^ := 
■ /o = n 3 /2 _2, 



H := n{l\x <t ) = 

J2 Xt =oP( X t\ X <t)^x t y ■ 



Since l°=ll-. 



2^3 
5^8' 



8 , v - 3V l-p) with p := p(l|x <i ) 
we have M = 1 and lf p 



A,, 



For p < |, we have Z° < < l 2 , hence y t 
i° = |. For p > |, we have ^ < ^ < hence p 
L Ap = / 2 



3 
8 ' 



and l t " 



hence I 



A„ 



|. Since m r 



p ~ "p 
-1 1 



>3>2> 



A, 



f. Since A m , lorm 



2 and 
predicts or 2, 
this shows that 



4)2 = 

I01-- 



£00=0 



(see Figure). 

1 = ^10 

3 /8=<?ll 
= ^12 







A 












>), 






^\ i 2 








1 


i . 


2 4 



5 35 2 3 5 



if™ jl t m = y|>1. The constant || can be enlarged to |— e 
by setting 4a = §+ £ instead of |. 

For y= {0,...,\y\-l}, \y\ > 3, we extend the loss 
function by defining i xy = 1 > 3, ensuring that actions y > 2 are never favored. 
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With this extension, the analysis of the \y\ = 3 case applies, which finally shows 
-i(ro). In general, a non-dense range of p(x t \x <t ) implies /f p , provided \y\ >3. 

We now construct a monotone universal Turing machine U satisfying -i(vii) 
(second line). In case where ambiguities in the choice of y in aigmin^^ matter we 
consider the set of solutions {aTgmm y £ xy } : = {y : £ X y = mm y £ xy } ^ {} . We define a one- 
to-one (onto A) decoding function d: {0,l} s ^A with A = {0 S+1 }U1{0,1} S \1{0 S } C 
X s+1 as d(0i :s ) = Oi :s+ i and d(xi :s ) = lxi- s for X\._ s 7^ 0i :s with a large s G IV to 
be determined later. We extend d to d : ({0,1} 5 )* — > A* by defining d(zi...Zk) = 
d(zi)...d(zk) for ^e{0,l} s and define the inverse coding function c:A— >{0,1} S and 
its extension c : A* — > ({0,1} S )* by c = d~ l . Roughly, U is defined as U(lpi :sn 0i :s ) = 
d(pi:sn)^i:s+i- More precisely, if the first bit of the binary input tape of U contains 
1, U decodes the successive blocks of size s, but always withholds the output until 
a block 0i :s appears. U is obviously monotone. Universality will be guaranteed by 
defining U(0p) appropriately, but for the moment we set U(0p) — e. It is easy to see 
that for xEA* we have 

Km(xO) = Km(xOi :s+ i) = l(c(x)) + s + 1 and , , 

Km(xl) = Km{xlz0 1:s+1 ) = l{c{x)) + 2s + 1, ^ ^ 



where z is any string of length s. Hence, m norrn (0\x) = [1 + 2 s ] 1 
1 and m norm (l\x) = [1 + 2 3 ]- 1 ™ 0. For t — 1 G (s + l)N we get : 

norm 

(x t \x <t )£ xm ^£oy t . This implies 



s— >oo 



yf m G {argminZ^} C {argminfoy} for sufficiently large finite s. (11) 

We now define p(z) = \A\~ 1 = 2~ s for z<EA and p(z) = for z<EX s+1 \A, extend it to 
p(zx...Zk)'=p(zi)-...-p(zk) for Zi<EX s+1 , and finally extend it uniquely to a measure 
on X* by p(x <t ) ■ = J2x tn r L ( x i:n) for JN3t<nE (s+l)JN. For x<EA* we have /i(0|x) = 

M°) = M°iw+i) = 2 ~ s ™ and M 1 ^) = M 1 ) = E 2/e ^/x(iy) = E zG a\{o-+i}^(^) = 

(2 s -l)-2- s = l-2- s ™l. Fort-lG(s+l)Wweget^:=E^(^|x <t )4 m ™^ t . 
This implies 

yf M G {argminZ^*} C {argmin£ ls/ } for sufficiently large finite s. (12) 

By definition, £ is non-degenerate iff {a.Tgmmy£ 0y }n{a.Tgmm y £i y } = {}. This, to- 
gether with (jllj) and (|12j) implies y£ m 7^ M , which implies lf m 7^ if M (otherwise 
the choice y^ — Vt 1 " would have been possible), which implies Z^ m /Zf M = c>l for 
t— lG (s+1)1V, i.e. for infinitely many £. 

What remains to do is to extend U to a universal Turing machine. We extend 
U by defining U(0zp) = U'(p) for any zg{0,1} 3s , where [/' is some universal Turing 
machine. Clearly, U is now universal. We have to show that this extension does 
not spoil the preceding consideration, i.e. that the shortest code of x has sufficiently 
often the form lp and sufficiently seldom the form Op. Above, p has been chosen in 
such a way that c(x) is a Shannon-Fano code for /i-distributed strings, i.e. c(x) is 
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with high /i-probability a shortest code of x. More precisely, l(c(x)) < Kmrix) + s 
with yU-probability at least 1 — 2~ s , where Kitlt is the monotone complexity w.r.t. 
any decoder T, especially T — U'. This implies min p {/(0p) : [/(Op) = x*} = 3s + l + 
Kra\ji[x) > 3s + l + Z(c(x)) — s > l(c(x)) + s + l > min p {Z(lp) : U(lp) = x*}, where the 
first > holds with high probability (1 — 2 _s ). This shows that the expressions (TTUj) 
for Km are with high probability not affected by the extension of U. Altogether this 



Speed of off-sequence convergence of m for computable environments. 

The probably most interesting open question is how fast m(xt\x<t) converges to 
zero in the deterministic case. 

Non-self-optimizingness for general U and i. Another open problem is whether 
for every non-degenerate loss-function, self-optimizingness of A m can be violated. 
We have shown that this is the case for particular choices of the universal Turing 
machine U. If A m were self-optimizing for some U and general loss, this would be 
an unusual situation in Algorithmic Information Theory, where properties typically 
hold for all or no U. So we expect A m not to be self-optimizing for general loss and 
U (particular \x of course). A first step may be to try to prove that for all U there 
exists a computable sequence xi :cx> such that Ku(x <t x t ) < Ku(x <t x t ) for infinitely 
many t (which shows -i(vii) for K and error-loss), and then try to generalize to 
probabilistic fi, Km, and general loss functions. 

Other complexity measures. This work analyzed the predictive properties of 
the monotone complexity Km. This choice was motivated by the fact that m is 
the MDL approximation of the sum M, and Km is very close to KM. We ex- 
pect all other (reasonable) alternative complexity measure to perform worse than 
Km. But we should be careful with precipitative conclusions, since closeness of 
unconditional predictive functions not necessarily implies good prediction perfor- 
mance, so distantness may not necessarily imply poor performance. What is easy 
to see is that K(x) (and K(x\l(x))) are completely unsuitable for prediction, since 
K(xO) = K(xl) (and K(xO\l(xO)) = K(xl\l(xl))), which implies that the predictive 
functions do not even converge for deterministic computable environments. Note 
that the larger a semimeasures, the more distributions it dominates, the better its 
predictive properties. This simple rule does not hold for non-semimeasures. Al- 
though M predicts better than m predicts better than k in accordance with (|8|). 

2~K(x\i(x)) > j^f( x ) j s a predictor disaccording with (JHJ). Besides the discussed 
prefix Kolmogorov complexity K, monotone complexity Km, and Solomonoff 's uni- 
versal prior M = 2~ KM , one may investigate the predictive properties of the histori- 
cally first plain Kolmogorov complexity C, Schnorr's process complexity, Chaitin's 
complexity Kc, Cover's extension semimeasure Mc, Loveland's uniform complex- 




□ 
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ity, Schmidhuber's cumulative K E and general K G complexity and corresponding 
measures, Vovk's predictive complexity KP, Schmidhuber's speed prior S, Levin 
complexity Kt, and several others |LV97[ IVW98[ ISchOOj . Many properties and rela- 
tions are known for the unconditional versions, but little relevant for prediction of 
the conditional versions is known. 

Two-part MDL. We have approximated M(x) :—J2 P :u(p)=x*^~ D Y its dominant 
contribution m(x) = 2~ Km ( x \ which we have interpreted as deterministic or one-part 
universal MDL. There is another representation of M due to Levin |ZL70j as a mix- 
ture over semi-measures: M{x)=^ v&Ma ^^~ K ^v[x) with dominant contribution 
m2(x) =2~ Km2 ^ and universal two-part MDL Km 2 (x) := minyg^semii — log u{x) + 
K(u)}. MDL "lives" from the validity of this approximation. K[y) is the complexity 
of the probabilistic model v, and —log v{x) is the ( Shannon- Fano) description length 
of data x in model v. MDL usually refers to two-part MDL, and not to one-part 
MDL. A natural question is to ask about the predictive properties of 777.2, similarly 
to m. 1712 is even closer to M than m is (777,2 = M), but is also not a semi- measure. 
Drawing the analogy to m further, we conjecture slow posterior convergence m2^// 
w.p.l for computable probabilistic environments \i. In |BC91j . MDL has been shown 
to converge for computable i.i.d. environments. 

More abstract proofs showing that violation of some of the criteria {%) — (iv) 
necessarily lead to violation of (vi) or (vii) may deal with a number of complexity 
measures simultaneously. For instance, we have seen that any non-dense posterior 
set {k(x t \x <t )} implies non-convergence and non-self-optimizingness; the particular 
structure of m did not matter. 

Extra conditions. Non- convergence or non-self-optimizingness of m do not neces- 
sarily mean that m fails in practice. Often one knows more than that the environ- 
ment is (probabilistically) computable, or the environment possess certain additional 
properties, even if unknown. So one should find sufficient and/or necessary extra 
conditions on \i under which m converges / A m self-optimizes rapidly. The results 
of this work have shown that for m-based prediction one has to make extra as- 
sumptions (as compared to M). It would be interesting to characterize the class of 
environments for which universal MDL alias m is a good predictive approximation 
to M. Deterministic computable environments were such a class, but a rather small 
one, and convergence is possibly slow. 
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